Summary

Tips given to Taxi Drivers form a substantial part of their income. They also serve as a loose metric of service quality that can be useful to Taxi companies. In this report, we analyze factors that affect tip ratio and train models to predict tips based on features. In the case of New York City’s taxi drivers, the most important factors affecting the tips are geographical factors such as the location of the pickup and the dropoff and also trip factors such as trip distance, trip durations and speed. It can also be said that in more affluent parts of the city and during peak hours, a higher percentage tip will be paid to any taxi driver. Hopefully with this information, taxi drivers can better plan their routines for the day in order to increase their own earnings.

Honor Pledge: On our honor, we pledge that we are the sole authors of this paper and we have accurately cited all help and references used in its completion.

1. Problem Description

1.1 Situation

1.1.A Tasking and Literature

Almost half a million taxi trips are made daily in the city that never sleeps, producing a plethora of information that can prove useful for both the passengers and drivers. We choose to understand what features (besides the quality of the driver) actually factors into the tip received by a cab driver and also into whether the trip is tipped. We also look how different factors affect different tipping variables. We choose to use NYC Green Taxi data[1] for the month of Sep 2015 for this project.

In 2013, the TLC (Taxi Limousine Commission), under then mayor Bloomberg, laid out a new program introducing a new taxi system painted ‘apple green’ on their exteriors. They are the second class citizens in the world of NYC taxis in that they cannot compete with the yellow taxis within the protected ‘yellow zone’ (below E 96th street and W 110th street). They are only allowed to pick up passengers outside of this “yellow zone”. Green taxis serve New York city by going to places the yellow taxis drivers prefer not to go.

The following paper will outline the analysis utilizing the Evidence Informed Systems Engineering methodology to help inform NYC Cab Drivers outlining: goals, metrics, hypotheses, approach, analysis, evidence and recommendations [2].

A particular study looks at the percent of people that use the preset tip pecents for the trips [3]. This study shows that 44% of the passengers hit the 20% button for the tip.

The following study[4] also dives into the relationship between tipping rates of a trip and the weather, sunlight in particular and found a small but statistically significant positive relationship between sunlight and tipping. This particular article[5] indicates that the Riders’ tips fail to keep the pace as the fares increase.

These two studies[6] [7] are closely related to our project and uses the same data and variables for their studies to study about the factors that affect the tips for the rides. Finally, this study[8] does a comparision between different classification methods to predict if a trip is tipped or not.

1.1.B Visualizing the data

We use leaflet package to plot the visualization below. We look the pickup and dropoff locations for the trips for random sample of 8000. This should give us a fair idea of which place the green taxis are ridden to or ridden from.

This shows us that the trips originate from everywhere except manhattan. We will address the reasons for that in the data section. Apart from that we see that the dropoff ocations are widely spread compared to the pickup locations. The map gives us an idea what some of the our distributions could look like. Let’s start with plotting the target features Tip amount and Tip percent.

Fig.2 is a histogram of trip across all trips with Tip Percent less than 50. The data has a long tail to its right so we restrict our analysis to trips with tip percents less than 50%. We see that majority of the tip lie between 20% and 25% if no tip trips are excluded. The mean tip percentage is 19% whereas the median tip percent is 21.5%.

Fig.3 is a plot of Fare amount (Tip excluded) and the Tip Percentage for all the trips. The horizontal lines in the graph correspond to the preset tip percentages (20, 25 and 30) that a passenger sees when they swipe their credit card. The asymptotic curves correspond to common tip values (such as USD10 or USD20).

The following observations can be made from fig.4 - 1. We see that the higher tip percents are payed for trips with lower durations and it decreases with increase in duration. 2. Most of the trip have durations less than 15 minutes. This might be because people prefer taxis for shorter rides and prefer other means of transport for longer rides. 3. It is also interesting to see that since trips with lower duration time also have lower fares, even if people pay a dollar or two more the Tip Percent is likely to go on the higher side. And riders with longer trip need to cash out more money to achieve higher tip percent.

The heatmap for Tip Percentages by hour of the day and day of the week says:

  1. The Tip percent are higher during the evenings on weekdays compared to the weekends. The time during which drivers get paid more tip in term of percent of fare amount is 5pm - 12am.
  2. We see that the Tip percent after midnight on weekends is higher compared to the weekdays. This might be because people go home late in the weekends when they leave the late night parties or whatever.
  3. The worst tip percent is during early morning between 5-8am and mostly during the weekends.

Fig. 6 is a plot of Tip Percentage to Speed. We also look at the number of trips having those average speeds. We also find- 1. It is interesting to see that higher tip percentages are related to average speeds between 10-15mph. 2. We can see that most of the trips have an average speed of again 10-15mph. This can be due to the fact that most of the trips are within the city and in the busiest areas which have speed restrictions and comparitively heavy traffic.

Here we look at the mean average tip percent by the bouroughs. We remove the trips for “Staten Island” because of the small number of trips made to and fro the borough. The following observations can be made from the plot -

  1. It is intersting to see that the taxi drivers are payed the lowest Tip Percent when a trip is made within Bronx. And also comparitively the riders who get down at Bronx pay the lowest Tip Percent.
  2. People who travel to or from Queens seem to pay more tip percent in general compared to all the other boroughs. And specifically people who ride from manhattan pay the highest tip percent. Brooklyn might be the best place to get higher tip percents. 3.Even though the tip percents from riders who get down in brooklyn aren’t great. People who travel withing Brooklyn tend to pay the highest tip percent.
  3. It will be intersting to research more about who these people might be or thier occupations might be. Having this information would help taxi drivers get the right customer.

1.1.C Correlation Plot

After engineering new features and before starting the modelling, we visualise the relations between our numeric parameters using a correlation matrix. We could hence change all our features to numeric to create a correlation plot but, that wouldn’t make sense.

We find:

  1. Fare amount and trip distance has significant correlation with tip amount which we are interested in and followed by duration and tip percent. This signifies that with increase in trip distance and fare amount the tip amout increases. So it is interesteing to see that longer trips are associated with higher trip amount.

  2. It is interesting to see that the speed and the number of passengers have little to no correlation with tip amount. One reason for lower correlation between number of passengers and tip amount might be the mix of data. The data seems o be dominated with trips with one or two passengers. It will be interesting to see how the correlation turns out to be if the mix of trip with number of passengers is even.

We will keep in mind that Trip distance, fare amount and duration are highly correlated while considering variables for our models.

1.2 Goal

While tips to taxi drivers are usually interpreted as a gesture of gratitude for service, we analyze what other factors can have an impact on tips. For example, do Taxi rides starting or ending in richer parts of the city result in better tips? Our Main goal is to analyse the factors that affect the tips (Tip Percentage) and use these factors to construct models to predict tips. This would help the cab drivers and the taxi companies in knowing their high tip paying customers.

1.3 Metrics

Our goal focuses on the tip and we have a variable Tip Amount in our data. But, that alone would not be helpful in learning about the riders’ tipping behavior. So, we create a new variable Tip Percent, which is defined as,

\[ \text{Tip Ratio =} \frac{Tip Amount}{Fare Amount} \] Where Fare Amount indicates Fare excluding the tips, taxes and tolls.

1.4 Hypothesis

Based on the literature review and the data analysis, We would like to prove if the dropoff location and the fare amount are statistically significant in predicting the Tip Percent.

In other words, the change in tip percentages is different in different boroughs with change in fare amount.

2. Approach

2.1 Data

We use the Green Taxi trip data for the month of September 2015 for NYC, which has around 1.5 Million rows. The data is obtained from NYC Taxi and Limousine Commission. To look at the data dictionary for the data, click here. There are few things that needs to be noted about the data, the data for year 2015 which we use has the dropoff/pickup_lat/long whereas, the trips data for year 2017 has pickup location id and dropoff location id.

2.1.A Missing Data

There are 4 missing values for the “trip_type” variable and 6 observations have “99” for RateCodeID which generally indicates missing values. Other than that, the data has no missing values.

2.1.B Data Cleaning

Apart from the missing values as stated above, a few sources of dirtiness in the data are as follows -

  • Unrealistic durations (We extract duration from pickup and dropoff time.)
  • Negative fares or zero trip distance
  • Wrong GPS coordinates (Zero lat/long)
  • Other

For the negative fare amounts, we substitute them with their absolute values since, which might be due to untrained human input error, GPS instrument error, etc.. We extract duration in minutes for the trip from the pickup and dropoff times. We also create a speed variable using duration and trip_distance.

Data Filtering

And for filtering the data, we consider the following steps:

  1. We ignore the trips with duration of less than a half a minute or greater than 1 hrs. (To ignore trips with that took longer than 3 hrs and also trips that were timed accidently or any other reason. Taxi hire for duration that short are unrealistic anyway.)
  2. We remove trips with zero distances or greater than 50 miles. (We restrict ourselves to short distance trips)
  3. We ignore trips with pickup/drop_lat/long fo zero values. (This would mean missing lat/long values.)
  4. Remove the observations with NA’s(or 99) for RateCodeID and Trip_type.
  5. Remove fare amounts less than USD 2.5 (Minimum fare).
  6. We finally restrict ourselves to trips that has credit card as the mode of payment because, the tips data is recorded for just credit card tips and the data for cash tips are not included in the dataset. This can be proved by looking at the sum of the tips for trip with cash as mode of payment vs credit card. The sum of tips for about 800K trips with cash as mode of payment is USD 163 whereas, the sum of tips for 700K trips with credit card as mode of payment is around USD 2 Million.

Column Selection

We also remove the following redundant columns:

  1. improvement_surcharge, MTA_tax, Extra (These variables have almost same values for most of the trips and would just increase noise.)
  2. Total_amount (We already have the Fare_amount and having this variable is unneccesary.)
  3. Payment_type

Doing all this brings down the number of observations by more than half. We now have data for around 600K trips.

2.1.C Feature Engineering

In this section we build new features from the existing data such as the target variable itself and also new potential predictors for the target variable.

We extract the borough of the pickups and dropoffs using a geojson polygon file of New York City. We restrict the boroughs to one the following five: 1. Manhattan 2. Bronx 3. Brooklyn 4. Queens 5. Manhattan

The geojson file can be obtained form this link here.

We can extract the neighborhoods for the pickups and dropoff too but, that would have a total number of around 300 and getting computational resources to use that variable becomes difficult. So we limit ourselves to the boroughs.

We also extract the hour of the day and the day of the week for the pickups. These variable might lead us to some useful insights on tipping habits of riders. We create a “Tip_percent” variable which gives us the tip in terms of percentage of fare amount.

Finally, we limit

  1. Speed variable in the data to 100 (average speed greater than 100 is unrealistic)
  2. TipPercent to 100% (It is rare when someone tips higher than the fare amount)
  3. Remove the observations which do not fall into either of the 5 bouroughs.

There should be less bias in the data as it was collected electronically and human wasn’t a part of data collection. But, it should be noted that there might be more unrealistic recorded data which might be due to some error. We’ve done our best to eliminate bias in the data.

2.2 Analysis

2.2.A Test vs Train Overlap

We create a test and a training data set out of the original data. Test dataset will be 1/6th of the original dataset. This test data will be used later on in the analysis to calculate the PMSE for the models. We look at the overlap for a sample test dataset.

In order to make sure that we are really training on features that are relevant to our test data set we will now briefly compare the temporal and spatial properties of the train and test data. This is another consistency check. Here are two relavent comparision plots.

We find that our train and test data sets do indeed cover the same time range and geographical area.

2.2.B Models

We sought to use linear regression methodology to fit a model on our hypothesis. From initial visualizations the day of the fare amount and the pickup and dropoff locations seem to have an affect on Tip Percent. The first linear model we fit focused on how the pickup and dropoff locations with the quantitative variable of interest such as fare amount and passenger count affect the Tip Percent. We don’t consider using trip distance or duration as one of our predictors due to high correlation with our predictor variables. We don’t have a preference for our base case. So, we proceed with the default.

We started with a first order linear model for our hypothesis.

TipPercent~boroughCode_d+boroughCode_p+Fare_amount+Passenger_count

TipPercent: Tip in term of Percentage of Fare amount boroughcode_d: Dropoff bourough for the trip boroughcode_p: Pickup borough for the trip Fare_amount: Fare for the trip excluding tips, taxes and tolls Passenger_count: Number of passengers on the trip

We see that all the terms are statisticaly significant from Manhattan except for the borough Staten Island in predicting the TipPercent. The less number of trips to and from Staten Island might have been the reason for this insignificance.

We went through the process of model assessment, automated variable selection through stepwise regression and model comparision with another model that includes all the interaction terms:

TipPercent~(boroughCode_d+boroughCode_p+Fare_amount+Passenger_count)^2

We find that both the models, the first order model and second order model with interaction terms have a significant p-values for the F-statistic. However, we face some problems with the boroughs while fitting our models. The number of trips to and from Staten Island are very low and so few of the interction terms turned out to be NA’s. But, we choose to not include Staten Island when we talk about aur models. Our model shows that there is significant difference in the TipPercent for the trips taken to and from Manhattan and other boroughs. Even thought the interaction terms between boroughcode_p/d and Fare_amount have a significant estimates for all boroughs other than Staten Island the values of the estimates are on the lower side when compared to the interaction terms between boroughcode_p and boroughcode_d. If this is the case with boroughcodes_p/d and Fare_amount, the interaction terms between Passenger_count and boroughcodes_p are only significant different from that of Manhattan for Bronx and Brooklyn. Estimates are alson on the lower side when compared to the interactions between boroughcode_p and boroughcode_d.

LM1: TipPercentboroughCode_d+boroughCode_p+Fare_amount+Passenger_count LM2: TipPercent(boroughCode_d+boroughCode_p+Fare_amount+Passenger_count)^2

The model utility test illustrated statistical significance of both models. Next, we conducted automated variable selecion method of stepwise regressionto reduce the complexity of linear model 2. The new model resulted in all the interactions except for interactions between boroughcode_d and Passenger_count.

LM2 (After Step): TipPercent~(boroughCode_d+boroughCode_p+Fare_amount+Passenger_count)^2

Next in table, we compared the models linear model 1 (main effects) and linear model 2 (pst step regression with interaction terms).

Model Adj R^2 AIC
LM1 0.0480 4226076
LM2(After Step) 0.0572 4220653

For criterion based assessments, we compare the models with Adjuster R^2 and AIC. Model 2 seems to be a better choice in terms of goodness of fit balanced with model complexity.

Next we conducted partial F-test to compare two models. The test supported model 2. the test revealed that the addition of interaction terms to model provides statistically significant information, with a p-value of 2.2e-16.

Finally, we compared the models via the train/test method. We separated the data into a train and test set. The train set encompassed 5/6 of the entire data; test set encompassed the remaining 1/6 of the data. The sets were checked to ensure they were random and representative. Linear models 1 and 2 were built on the train sets, and evaluated on the test sets. Twenty iterations were executed, with the mean of the predicted mean square errors (PMSE) recorded to compare performance of the two models.

The plot below illustrates the PMSE for each model during iteration. The table summarises the mean and median PMSE for each model.

Line Model Mean PMSE Median PMSE
Blue LM1 92.725 92.758
Red LM2(After Step) 91.902 91.942

Model 2 has a lower PMSE by small margin, suggesting a better fit to the data and suggesting that model 2 has better predictive accuracy.

In conclusion, AIC, the partial F-test, Adj. R^2 and the PMSE suggests that model 2 as a better linear model.we sought to capitalize on the analytical granularity that the complex model offers. The insights on interactions will help us provide more targeted recommendations to the stake holders.

3. Evidence

After conducting the model assessments and comparisions for our hypothesis, we move to the evaluations of the model. The following model is being evaluated under diagnostics:

Hypothesis: Step of “TipPercent~(boroughCode_d+boroughCode_p+Fare_amount+Passenger_count)^2”

3.1 Diagnostics for Hypothesis

Below are the diagnostic plots for our hypothesis

Reviewing the Residual vs. Fitted plot and the Scale-Location plot, the error seems to be random and probably due to high concentration of fitted values in a small range. This might be due to the lower prediction power of the model. The Normal Q-Q plot reveals the tails are divergent, especially on the right hand side. This illustrates the limits of the model in low and high value points. The Residual vs. Leverage plot reveals three more possible influential points. After investigating these two events, it was decided to remove them within the data as they were just normal trips but with high TipPercent, removing which would probably increase the model performance.

Linear model 2 is the model we apply to our hypothesis. The model is statistically significant with a p-value of 2.2e-16, as articulated by the model utility test. The model is articulated in the appendix of the report. The complete summary of the model can be found in the appendix section.

Overall, we find the model for our hypothesis is not particularly good at prediction but, important insights can be drawn from the model. It is also important to note that it proves our hypothesis true. The model illustrates statistical significance between the pickup and dropoff boroughs. It also shows the significance of fare amount and passenger count in predicting the tip percentage, eventhough the accuracy of the model is pretty low.

Going back to the model we see that the trip that originates from any borough but end at Bronx has comparitively lower tipping rates when compared to our base case manhattan. This statement can be supported by the estimated and the p-values for the interaction terms in our 2nd linear model with interaction terms. It can also be said that all the other trip have caomparitively higher Tip Percent when compare to trips that originate from and end at Manhattan. However, the difference in the tipping Percent are different for different pairs of boroughs. Only excet to this apart from Bronx are the trips that originate from Bronx and end at Brooklyn. This has lower tip percentage too.

The Fare_amount is found to be statistically significant in precicting the tip percent. The models hows a decrease in tip percent with increase in fare amount when the pickup and dropoff location is Queens when compared to Manhattan. However, when compared to Manhattan, the tip percents decreases with increase in fare amount if the dropoff locations are Bronx and Brooklyn but, tip percent increases if people are pick up from these locations. However, the interaction terms with the pickup/dropoff boroughs shows little to no difference in the tip percentage even though that difference is statistically significant. In reality his cannot be translated into actionable recommendation.

The Passenger count is found to be statistically significant as well. When compared to Manhattan the tip percents increase with increase in number of passengers if the trip originates from Bronx and decreses if a trip originates in Brooklyn. However, the passenger count variable doesn’t satisfy the normal condition and majority of the trip have 1 or 2 passengers, so this can’t be relied upon in making recommendations or statements about the affect of number of passengers in predicting tip percentage.

3.2 Comparing to Random Forest

To compare the linear models prediction accuracy to other sophisticated models, we train a random forest model and calculate the MSE. Random Forests solve the proble of variance which decison trees and bagging suffers. Random forest does this by decorrelating the trees by choosing the best variable to split the node from a new set of variables everytime. We now train a regression random forest model.

The above plot shows the variables that are considered important in predicting the Tip Percent variable. It shows that the Trip distance, duration and speed are the most important variables in the data set in predicting the target variable. It also shows that hour of the day and the fare amount play an important role as well. This doesn’t explain the increase or decrease of tip percentage, rather it gives the variables that are related to change in tip percentage. When we use 50 trees due to limited computational power, we see that the MSE is 111.5, this can be due to the less trees used or might be because the random forest complicates a simple model.

4. Recommendations/Conclusions

Although the model was statistically significant in predicting the tip percents, the MSE for the model is high and hence, the model cannot be usedfor prediction of tipping percent. We can further decrease our RMSE if we are presented with information of the rider and driver for each trip. Tipping habits of riders and quality of service of drivers matter towards tipping at least as much as pickup locations, etc. However, the model gives some useful insights on what factors into the tipping behavior of the NYC taxi riders’ such as the pickup and dropoff locations, duration of trip, average speed and trip distance. The models shows that having pickup locations at a particular borough at a particular time can lead to an increased tiping percentage. This information is useful to the can drivers and the taxi companies if the goal is to increase the tip percent of the trip.

We conclude that viewing tips as solely a measure of service quality can be misleading. A large number of factors such as location, trip distance, trip duration and fare have a significant effect on tip. There are also some temporal effects on tip, such as the time of day and day of week.

4.1 Recommendations on Follow of Analysis

Even though, our model didn’t turn out to be of much help, a further research in this direction can lead to more useful insights. A few of them are as follows.

  1. We can use extend our analysis by introducing external data such as the weather data which can further talk about how the changes in weather changes the riders tipping behavior. This data can be exctracted from National Weather Service Forecast Office website. It would be interesting to look if rainy days and snowy days changes how the rider tips.

  2. Due to computational limitations, we didn’t have a chance to use the neighborhood data of pickups and dropoffs. There were around 300 neighborhood categories for the data we have and it is just not plausible to use a variable with 300 categories. So, instead we just used the boroughs. I believe using neighborhoods would be a great way to how the tipping behavior changes across neighborhoods or see which areas gets the most tips. The shapefile for the neighborhood can be found at Zillow.

  3. Instead of looking at whether a ride results in a tip or not, it would be more interesting to look at how these variables affect the tip in terms of percentage of fare amount. We could use regression models instead of classification models to achieve the same. It would be more helpful and meaningful to regular taxi driver to know which places or time gets more tip than knowing plainly whether a trip gets you a tip or not.

  4. We can compare the tipping behaviors of the riders for the yellow taxis and the green taxis. We can even introduce Uber and Lyft data to differentiate between the services.

  5. Finally, due to computational limitations, we had to limit our data to a month for exploratory data analysis and to a day for modelling and prediction. It would be more accurate if we can look at the data for an year or more, where the season or months might be a potential predictor for answering the question.

5. References

[1] NYC Taxi Commission and Limousine Website. Available at: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml [2] Project 1 template," Class template in SYS 4021, 2017. [3] Analysis of NYC Taxi Tip Data: 44% of Passengers Hit the 20% Button. Available at: http://dfkoz.tumblr.com/post/106719206826/analysis-of-nyc-taxi-tip-data-44-of-passengers [4] Taxicab Tipping and Sunlight. Available at: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0179193 [5] FLEGENHEIMER, M. As Taxi Fares Increase, Riders’ Tips Fail to Keep Pace.(2013) Available at: http://www.nytimes.com/2013/01/03/nyregion/as-taxi-fares-rise-riders-tips-dont-keep-pace.html [6] Gupta,S. and Nagda R. NYC Taxi Tip-Rate Prediction. Available at: https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a075.pdf [7]Jain, S. and See, A. Predicting Taxi Tip-Rates in NYC Available at: https://cseweb.ucsd.edu/~jmcauley/cse190/reports/sp15/050.pdf [8]Chandrasekharan, R. Predicting NYC Taxi Tips using MicrosoftML. Available at: https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2017/01/17/predicting-nyc-taxi-tips-using-microsoftml/

5.A Appendix A

Summary of our choosen linear model with interaction terms. It is also the stepped versionof the model which resulted in high accuracy.

summary(lmtip2.step)
## 
## Call:
## lm(formula = TipPercent ~ boroughCode_d + boroughCode_p + Fare_amount + 
##     Passenger_count + boroughCode_d:boroughCode_p + boroughCode_d:Fare_amount + 
##     boroughCode_p:Fare_amount + boroughCode_p:Passenger_count, 
##     data = taxi_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.121  -4.890   2.254   5.279  91.319 
## 
## Coefficients: (2 not defined because of singularities)
##                                                         Estimate
## (Intercept)                                            19.789600
## boroughCode_dBronx                                     -4.480369
## boroughCode_dBrooklyn                                   7.918896
## boroughCode_dQueens                                     6.596503
## boroughCode_dStaten Island                             10.462783
## boroughCode_pBronx                                     -6.548829
## boroughCode_pBrooklyn                                   1.403089
## boroughCode_pQueens                                     3.227094
## boroughCode_pStaten Island                             32.115335
## Fare_amount                                            -0.143330
## Passenger_count                                         0.073671
## boroughCode_dBronx:boroughCode_pBronx                   2.771182
## boroughCode_dBrooklyn:boroughCode_pBronx                2.775003
## boroughCode_dQueens:boroughCode_pBronx                 -0.204426
## boroughCode_dStaten Island:boroughCode_pBronx                 NA
## boroughCode_dBronx:boroughCode_pBrooklyn                5.899935
## boroughCode_dBrooklyn:boroughCode_pBrooklyn            -5.529947
## boroughCode_dQueens:boroughCode_pBrooklyn              -4.600782
## boroughCode_dStaten Island:boroughCode_pBrooklyn       -5.723265
## boroughCode_dBronx:boroughCode_pQueens                  6.216713
## boroughCode_dBrooklyn:boroughCode_pQueens              -5.327909
## boroughCode_dQueens:boroughCode_pQueens                -8.050386
## boroughCode_dStaten Island:boroughCode_pQueens          3.815404
## boroughCode_dBronx:boroughCode_pStaten Island                 NA
## boroughCode_dBrooklyn:boroughCode_pStaten Island      -12.611246
## boroughCode_dQueens:boroughCode_pStaten Island         -4.378478
## boroughCode_dStaten Island:boroughCode_pStaten Island -29.064323
## boroughCode_dBronx:Fare_amount                         -0.066168
## boroughCode_dBrooklyn:Fare_amount                      -0.154800
## boroughCode_dQueens:Fare_amount                        -0.047943
## boroughCode_dStaten Island:Fare_amount                 -0.006344
## boroughCode_pBronx:Fare_amount                          0.120546
## boroughCode_pBrooklyn:Fare_amount                       0.026078
## boroughCode_pQueens:Fare_amount                        -0.028829
## boroughCode_pStaten Island:Fare_amount                 -0.347113
## boroughCode_pBronx:Passenger_count                      0.293396
## boroughCode_pBrooklyn:Passenger_count                  -0.025937
## boroughCode_pQueens:Passenger_count                    -0.006958
## boroughCode_pStaten Island:Passenger_count             -3.319185
##                                                       Std. Error t value
## (Intercept)                                             0.056495 350.289
## boroughCode_dBronx                                      0.192912 -23.225
## boroughCode_dBrooklyn                                   0.323464  24.482
## boroughCode_dQueens                                     0.190727  34.586
## boroughCode_dStaten Island                              8.121904   1.288
## boroughCode_pBronx                                      0.238252 -27.487
## boroughCode_pBrooklyn                                   0.109577  12.805
## boroughCode_pQueens                                     0.147140  21.932
## boroughCode_pStaten Island                             26.658061   1.205
## Fare_amount                                             0.003371 -42.515
## Passenger_count                                         0.025285   2.914
## boroughCode_dBronx:boroughCode_pBronx                   0.198820  13.938
## boroughCode_dBrooklyn:boroughCode_pBronx                0.776802   3.572
## boroughCode_dQueens:boroughCode_pBronx                  0.444927  -0.459
## boroughCode_dStaten Island:boroughCode_pBronx                 NA      NA
## boroughCode_dBronx:boroughCode_pBrooklyn                0.647630   9.110
## boroughCode_dBrooklyn:boroughCode_pBrooklyn             0.287853 -19.211
## boroughCode_dQueens:boroughCode_pBrooklyn               0.159526 -28.840
## boroughCode_dStaten Island:boroughCode_pBrooklyn        5.287083  -1.082
## boroughCode_dBronx:boroughCode_pQueens                  0.416625  14.922
## boroughCode_dBrooklyn:boroughCode_pQueens               0.298384 -17.856
## boroughCode_dQueens:boroughCode_pQueens                 0.166710 -48.290
## boroughCode_dStaten Island:boroughCode_pQueens          5.486773   0.695
## boroughCode_dBronx:boroughCode_pStaten Island                 NA      NA
## boroughCode_dBrooklyn:boroughCode_pStaten Island       11.647804  -1.083
## boroughCode_dQueens:boroughCode_pStaten Island         23.462331  -0.187
## boroughCode_dStaten Island:boroughCode_pStaten Island  21.000287  -1.384
## boroughCode_dBronx:Fare_amount                          0.010907  -6.067
## boroughCode_dBrooklyn:Fare_amount                       0.005122 -30.221
## boroughCode_dQueens:Fare_amount                         0.004826  -9.934
## boroughCode_dStaten Island:Fare_amount                  0.094514  -0.067
## boroughCode_pBronx:Fare_amount                          0.009750  12.363
## boroughCode_pBrooklyn:Fare_amount                       0.005012   5.203
## boroughCode_pQueens:Fare_amount                         0.005563  -5.183
## boroughCode_pStaten Island:Fare_amount                  0.475306  -0.730
## boroughCode_pBronx:Passenger_count                      0.073164   4.010
## boroughCode_pBrooklyn:Passenger_count                   0.030848  -0.841
## boroughCode_pQueens:Passenger_count                     0.035395  -0.197
## boroughCode_pStaten Island:Passenger_count              2.283977  -1.453
##                                                       Pr(>|t|)    
## (Intercept)                                            < 2e-16 ***
## boroughCode_dBronx                                     < 2e-16 ***
## boroughCode_dBrooklyn                                  < 2e-16 ***
## boroughCode_dQueens                                    < 2e-16 ***
## boroughCode_dStaten Island                            0.197671    
## boroughCode_pBronx                                     < 2e-16 ***
## boroughCode_pBrooklyn                                  < 2e-16 ***
## boroughCode_pQueens                                    < 2e-16 ***
## boroughCode_pStaten Island                            0.228314    
## Fare_amount                                            < 2e-16 ***
## Passenger_count                                       0.003573 ** 
## boroughCode_dBronx:boroughCode_pBronx                  < 2e-16 ***
## boroughCode_dBrooklyn:boroughCode_pBronx              0.000354 ***
## boroughCode_dQueens:boroughCode_pBronx                0.645905    
## boroughCode_dStaten Island:boroughCode_pBronx               NA    
## boroughCode_dBronx:boroughCode_pBrooklyn               < 2e-16 ***
## boroughCode_dBrooklyn:boroughCode_pBrooklyn            < 2e-16 ***
## boroughCode_dQueens:boroughCode_pBrooklyn              < 2e-16 ***
## boroughCode_dStaten Island:boroughCode_pBrooklyn      0.279031    
## boroughCode_dBronx:boroughCode_pQueens                 < 2e-16 ***
## boroughCode_dBrooklyn:boroughCode_pQueens              < 2e-16 ***
## boroughCode_dQueens:boroughCode_pQueens                < 2e-16 ***
## boroughCode_dStaten Island:boroughCode_pQueens        0.486816    
## boroughCode_dBronx:boroughCode_pStaten Island               NA    
## boroughCode_dBrooklyn:boroughCode_pStaten Island      0.278936    
## boroughCode_dQueens:boroughCode_pStaten Island        0.851961    
## boroughCode_dStaten Island:boroughCode_pStaten Island 0.166360    
## boroughCode_dBronx:Fare_amount                        1.31e-09 ***
## boroughCode_dBrooklyn:Fare_amount                      < 2e-16 ***
## boroughCode_dQueens:Fare_amount                        < 2e-16 ***
## boroughCode_dStaten Island:Fare_amount                0.946486    
## boroughCode_pBronx:Fare_amount                         < 2e-16 ***
## boroughCode_pBrooklyn:Fare_amount                     1.96e-07 ***
## boroughCode_pQueens:Fare_amount                       2.19e-07 ***
## boroughCode_pStaten Island:Fare_amount                0.465210    
## boroughCode_pBronx:Passenger_count                    6.07e-05 ***
## boroughCode_pBrooklyn:Passenger_count                 0.400464    
## boroughCode_pQueens:Passenger_count                   0.844148    
## boroughCode_pStaten Island:Passenger_count            0.146156    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.576 on 561692 degrees of freedom
## Multiple R-squared:  0.0599, Adjusted R-squared:  0.05984 
## F-statistic: 994.2 on 36 and 561692 DF,  p-value: < 2.2e-16

This is the main effects model which we compared our best model to

summary(lmtip1)
## 
## Call:
## lm(formula = TipPercent ~ boroughCode_d + boroughCode_p + Fare_amount + 
##     Passenger_count, data = taxi_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.834  -4.848   2.227   5.280  55.157 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                20.004213   0.034874 573.619  < 2e-16 ***
## boroughCode_dBronx         -4.814067   0.080012 -60.167  < 2e-16 ***
## boroughCode_dBrooklyn       0.443913   0.042319  10.490  < 2e-16 ***
## boroughCode_dQueens         0.551428   0.048726  11.317  < 2e-16 ***
## boroughCode_dStaten Island  6.107866   1.038597   5.881 4.08e-09 ***
## boroughCode_pBronx         -2.883826   0.087634 -32.907  < 2e-16 ***
## boroughCode_pBrooklyn       1.782845   0.043224  41.247  < 2e-16 ***
## boroughCode_pQueens         0.652381   0.051720  12.614  < 2e-16 ***
## boroughCode_pStaten Island  0.251040   2.825548   0.089    0.929    
## Fare_amount                -0.147089   0.001484 -99.149  < 2e-16 ***
## Passenger_count             0.072188   0.012358   5.841 5.18e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.621 on 561718 degrees of freedom
## Multiple R-squared:  0.05096,    Adjusted R-squared:  0.05094 
## F-statistic:  3016 on 10 and 561718 DF,  p-value: < 2.2e-16

And finally this is the random forest model summary

print(random.forest.model)
## 
## Call:
##  randomForest(x = sept15_train[, -c(15)], y = sept15_train$TipPercent,      ntree = 50) 
##                Type of random forest: regression
##                      Number of trees: 50
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 100.5635
##                     % Var explained: 0.07